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Abstract 


1 


In this paper, we propose a number of adaptive prototype learning (APL) algorithms. They employ 
the same algorithmic scheme to determine the number and location of prototypes, but differ in the 
use of samples or the weighted averages of samples as prototypes, and also in the assumption of 
distance measures. To understand these algorithms from a theoretical viewpoint, we address their 
convergence properties, as well as their consistency under certain conditions. We also present a 
soft version of APL, in which a non-zero training error is allowed in order to enhance the 
generalization power of the resultant classifier. Applying the proposed algorithms to twelve UCI 
benchmark data sets, we demonstrate that they outperform many instance-based learning 
algorithms, the &-nearest neighbor rule, and support vector machines in terms of average test 
accuracy. 

Keywords: adaptive prototype learning, cluster-based prototypes, consistency, instance-based 
prototype, pattern classification 


Introduction 


We divide this section into two parts, with the first part addressing the background of all related 
methods and the second part discussing our contributions. 


11 Background 


In pattern cognition, one method for classifying objects, expressed as feature vectors, is to 
compute the distance between the vectors and certain labeled vectors, called prototypes. This 
approach selects the k nearest prototypes for each test object and classifies the object in terms of 
the labels of the prototypes and a voting mechanism. Prototypes are vectors that reside in the 
same vector space as feature vectors and can be derived from training samples in various ways. 
The simplest way is to use all training samples as prototypes (Fix and Hodges, 1951, 1952, 
1991a, 1991b). Besides not incurring any training costs, this approach has two major advantages. 
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First, for a finite set of training samples S, the error rate using all samples as prototypes does not 
exceed twice the Bayes risk (Cover and Hart, 1967). Second, it ensures consistency, or 
asymptotic Bayes-risk efficiency (Stone, 1977; Devroye and Gyórfi, 1985; Zhao, 1987; Devroye 
et al., 1994). 


However, recruiting all training samples as prototypes can incur a high computational cost during 
the test procedure, which is prohibitive in applications with large corpora. Consequently, certain 
editing rules have been proposed to reduce the number of prototypes. The condensed nearest 
neighbor (CNN) rule (Hart, 1968) was the first, and perhaps simplest, proposal among many 
subsequent ones, all of which try to extract a subset from a collection of samples. These 
algorithms execute a process iteratively to check the satisfaction of certain criteria for the current 
set of prototypes, and add or drop prototypes until a stop condition is met. Wilson and Martinez 
(2000) collected and compared many algorithms of this type (in particular, DROPI to DROPS), 
and categorized them as instance-based learning (IBL) algorithms. More recently, an alternative 
IBL algorithm called the Iterative Case Filtering (ICF) algorithm (Brighton and Mellish, 2002) 
was proposed. ICF runs faster than most IBL algorithms, which drop rather than add samples 
(this point is discussed further in Section 7.2), yet it achieves comparable accuracy to the latter 
algorithms. 

Another method for finding prototypes can be categorized as cluster-based learning (CBL) 
algorithms, in which prototypes are not samples per se, but can be derived as the weighted 
averages of samples. The k-means clustering algorithm (Lloyd, 1982; Max, 1960; Linde et al. 
1980), the fuzzy c-means algorithm (Bezdek, 1981; Hóppner et al., 1999), and the learning vector 
quantization algorithm (Kohonen, 1988, 1990) are examples of this method. Instead of 
representing prototypes as the weighted averages of samples, they can be represented as centroids 
of clusters (Devi and Murty, 2002), or as hyperrectangles (high-dimensional rectangles) 
(Salzberg, 1991). In the latter case, the distance between a sample and a hyperrectangle not 
containing the sample is defined as the Euclidean distance between the sample and the nearest 
face of the hyperrectangle. 

In their guidelines for the design of prototype learning algorithms, Devroye et al. (1996, 
Chapter 19) propose some sufficient conditions for the consistency of this kind of algorithm. The 
conditions stipulate that: (a) the algorithm should minimize the empirical error, which is the error 
in classifying training samples; and (b) the number of prototypes should grow as a lower order of 
the number of training samples. 

Support vector machines (SVM) can also be used for pattern classification. In this approach, 
objects are classified by maximizing the margins between samples with different labels, where 
the margin is defined as the gap between two parallel hyperplanes (Figure 1a). The consistency of 
SVM is assured if the samples are bounded and the margin between samples with different labels 
holds (Vapnik, 1995; Schólkopf et al. 1999; Cristianini and Shawe-Taylor, 2000). 


12 Our Contributions 

The requirement that data should be bounded is reasonable, since it is a common practice in 
applications to normalize feature values to a certain bounded interval (between 0 and 1, for 
example). The margin assumption, on the other hand, is unique to SVM. However, we can prove 
the consistency of CNN under a more relaxed assumption (Figure 1b). For convenience, we say 
that two labeled entities (that is, samples or prototypes) are homogeneous if they have the same 
label; otherwise, they are heterogeneous. We require a non-zero distance between heterogeneous 
samples. 


Despite its consistency, CNN could be improved in two ways. First, its criterion for prototype 
satisfaction is rather weak and could be strengthened. Second, it is not difficult to develop an 
alternative process by using a cluster-based rule to construct prototypes. Experiments show that 
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the latter process often achieves better test accuracy than CNN. Another issue with this algorithm 
is its theoretical standing. The consistency of CNN derives from the fact that its prototypes are 
samples and thus always keep a certain distance from each other. The cluster centers, on the other 
hand, are not samples but the weighted averages of samples, so it is difficult to control the 
distances between them. To resolve this problem, we adopt a hybrid solution that combines 
cluster centers and certain selected samples to maintain a desirable separation between all the 
resultant prototypes. 





(a) (b) 


Figure 1. (a) А margin exists between two data sets. (b) A positive distance exists between two 
data sets. 


Note that it is not always appropriate to minimize training errors for SVM. Sometimes, a 
higher number of training errors should be tolerated so that prediction errors can be reduced. Such 
flexibility, which is built into the “soft-margin” version of SVM (Cortes and Vapnik, 1995; 
Bartlett and Shaw-Taylor, 1999), yields better test accuracy than the “hard-margin” version. 
Fortunately, this flexibility also exists in adaptive prototype learning (APL) algorithms, and can 
be derived by a tradeoff between the number of prototypes and their predictive power. However, 
although APL reduces training errors by adding prototypes, it increases the risk of overfitting. A 
balance between these two factors is made possible by a cross-validation study, similar to that 
used for SVM. We discuss this point further in Section 6. 

In summary, we propose two types of prototype learning algorithm. The first is an instance- 
based algorithm, which adds samples as prototypes according to an enhanced absorption criterion. 
The advantage of this approach (discussed in Section 7.2) is that it achieves substantially higher 
test accuracy at a relatively low training cost, compared to other instance-based algorithms, 
whose major merit is a lower ratio of prototypes to training samples. Although our algorithm 
achieves higher test accuracy at the expense of a somewhat higher ratio of prototypes to training 
samples, we believe this is acceptable, since it enables the proposed classifier to even outperform 
the k-nearest neighbor (k-NN) rule in terms of accuracy. The second approach is a hybrid method 
that constructs prototypes as either samples or the weighted averages of samples. Compared to 
SVM, the hybrid prototype learning method yields higher test accuracy, at the expense of a higher 
training cost (discussed in Section 7.3). 

The remainder of the paper is organized as follows. In the next section, we present the 
Vapnik-Chervonenkis (VC) theory of multiclass classification. In Section 3, we provide proof of 
the consistency of CNN under certain conditions. In Section 4, the extension of CNN to APL is 
discussed, along with the convergence of APL and its consistency under the same conditions. In 
Sections 5 and 6, respectively, we describe a kernelized version and a soft version of APL. 
Section 7 contains experimental studies of APL and comparisons with some instance-based 
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learning algorithms, namely, k-NN, CNN and SVM. Finally, in Section 8, we present our 
conclusions. 


2  Vapnik-Chervonenkis Theory of Multiclass Classification 


In this section, we develop a basic theory of prototype-learning algorithms. In particular, we 
derive an asymptotic result for generalization errors of prototype learning algorithms. For the case 
of binary classification, in which an object is classified as one of two class types, the standard 
Vapnik-Chervonenkis (VC) theory provides such a bound. This theory, however, is not sufficient 
for our purpose, since we deal with multiclass classifications in which we want to classify an 
object into one of m classes, with m 2 2. Here, we focus on extending the standard VC theory to 
such a case. 


The standard VC theory is a probabilistic theory that has great breadth and depth. To present 
a complete version of the theory in a journal paper is impossible. In fact, it is also unnecessary, 
since a comprehensive treatment can be found in the book А Probabilistic Theory of Pattern 
Recognition (Devroye et al., 1996). For this reason, we follow its notations closely (with some 
minor changes to suit our purpose) and quote those theorems that are relevant to our task. 

We assume there are n training samples (xi, yi), ..., Xn, Yn), and a test sample (x, y) drawn 


independently from the set R'A according to the same distribution, where ^ = {1, 2,..., m} isa 


set of labels or class types. Then, for a classifier о: R^ —A, we define its training error L, (g) and 
testing error L(g) as follows. 

Definition 1 The training error of a classifier g is defined as the fraction of training samples 
misclassified by g, that is, L, (g)= (1/п)> 1 
li(x)s ^1 if and only if g(x;) * у,. The testing error of a classifier g is defined as the 


tg(x,ey,} » Where I is the indicator function such that 


probability that a test sample has been misclassified by g, that is, L(g) = Príg(x)zyj. 

Typically, from the training samples, a learning algorithm tries to build a classifier g of a 
generic class C, with the objective that g can generalize well in the sense that it has a small testing 
error. The standard VC theory provides a bound for the testing error of binary classifiers. This 
bound can be expressed in terms of the following complexity measure of C. 


Definition 2 Let C be a collection of binary classifiers of the form g : К“ > (0, 1}. For any n, the 
n" shatter coefficient of C is defined as 


S(C,n)- max 
Tc d 


{gr :geC}|, 
В“ ,|T|-n 


where gr is the function obtained by restricting g to the domain T, and |X| for any set X is the 
number of elements of X. 

Intuitively, ће n” shatter coefficient of C is the maximum number of ways that an n-element 
subset can be partitioned by the classifiers in C. The following well-known result of Vapnik and 
Chervonenkis (1971, 1974a, 1974b) provides a bound for the testing error of classifiers in C, 
which we denote as the VC-bound. We adopt this result from Theorem 12.6 in Devroye et al. 
(1996). 


Theorem 3 Let C be a collection of binary classifiers. Then, for any n and any € > 0, 


Pe sup | L, (g)-L(g)|» | <85(С, ne "* 32. 
C 


ge 
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Next, we explain how to obtain an analogous result for multiclass classifiers. For a collection 
C of multiclass classifiers g: R^ — A, let C^ be the class of binary classifiers g: R^ > (0, 1} such 
that g'(x) = 1 if and only if g(x) = i, for ge C and i = 1, 2,..., т. 


Theorem 4 Let C be a collection of multiclass classifiers of the form g: R^ >A. Then, for any n 
and any & » 0, 


Pr sup |L,(g)— L(g) > 7 « 8S(C? ,n)e "^ 8m 


geC 


Proof: First, consider any classifier g: R^ ^A. A sample (x, y)e R^xA is misclassified by g if and 
only if it is misclassified by both g' and gf". Thus, 2L, (еу Г (g')and 


i=1 ^n 


2L(g)= a L(g‘). Then, by triangle inequality, the condition | L,(g) - L(g) > = implies that 


У" 2,08") – 106) 121" L ()- У" LGDISI2L (g) -2L()|» 28, 


and |2, (g) - L(g') |» 2e/ m for some i. Therefore, 
Pr sup | L(g) - L(g) > | < "| sup | E (g^) - L(g’) |> зет}, 
geC g'ec? 


This theorem follows from the previous theorem with = replaced by 2¢/m. П 


As stated in the above theorem, the VC-bound is the product of two terms. The first 1s just the 
shatter coefficient, whose magnitude depends on the collection of classifiers C. The second term 
decays exponentially to zero as и — oo. To obtain an asymptotic result from this product, we 
need to know how fast the shatter coefficient grows as n >. If its growth is slower than the 
decay of the second term, then the VC-bound approaches zero as и — о. 

Let us now define some terms. A prototype data pair (p, y) consists of a prototype p € А“ 
and its label y. We say that a classifier uses the 1-NN rule based on prototype data pairs if g 


assigns to each x e А“ ће label of the nearest prototype to x. The collection of all multiclass 
classifiers using thel-NN rule based on k prototype data pairs is denoted by Cj, while the 
collection of binary classifiers using the 1-NN rule based on k prototype data pairs is denoted by 
Bi. We want to derive a result for Сор in terms of a known result of Bœ. To do this, we adopt the 
following lemma from Devroye et al. (1996, p. 305), which provides a bound for S(Bœ, п). 


Lemma 5 S(Byw, n) x (ne/(d« 1)) ^ 
From Theorem 4 and Lemma 5, we derive the following result for С. 


Theorem 6 For any n and any € >Q, 


Д sup | Z,(g) - L(g) > | € 8(ne /(d + 1) ED one Квт?) 


вєС\үк) 


Proof: In order to apply Theorem 4, we need to find a bound for 5(Сф р, п) , Where С; derives 
from Cœ in the same way as C^ derives from C. Since Cas © Ву, we һауе 


5(Сф у, п) € $(Bu,n), which follows easily from the definition of the shatter coefficient. 
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Therefore, by Lemma 5, we have S( С, n) € (ne/(d-1))^ ^), Combining this inequality and 
Theorem 4, we obtain the desired result. 0 


For a given sequence of training data D, = {(X1, yi), ..., (Kn Vn)}, a classification rule is a 
sequence of classifiers {g,} such that g, is built on D,. Such a rule is said to be consistent, or 
asymptotically Bayes-risk efficient, 1f 


lim Pr(Z(g,) — inf L(g) > ¢} =0, 
n-oo ig 


for any => 0, where inf L(g) is the infimum (i.e., the greatest lower bound) of the testing errors 
8 


of all classifiers of the form g: R^ — Л. A prototype classification rule, on the other hand, is a 
sequence (g,) such that е, uses the 1-NN rule based on k, prototypes for some &,. The following 
corollary provides a sufficient condition for the consistency of a prototype classification rule. 
Note that, in stating the corollary, we use о(Ќи)) to denote a quantity whose ratio to Ќи) 
approaches zero as n>. 


Corollary 7 Suppose that [g, is a prototype classification rule such that L, (g,) 2-0 and k, = o( 
ne /(m dlogn) ) for all n. Then, for any & » 0, 


lim Pr(L(g,) - inf L(g) > =} = lim Pr{L(g,,) > e) = 0. 
n—oo R n—»oo 


Proof: Since L, (g,)=0, the condition that L(g,)>£ implies that| (m (g,)-L(g,)|»& and 


thus sup IL, (g)-L(g)|>e as well. Hence, Pr{L(g,)>e}< " sup | L, (g)- L(g) |> j| 


вєС 868.) 
Also, since k, = o( ném dlog n) ), by Theorem 6, we have Pr(L(g,)» =} 0 as n>. 
Finally, Z(g,) -inf L(g) > = implies that L(g,)> =, so the probability of the former inequality 
8 


also approaches zero as n > о. П 


3 The Condensed Nearest Neighbor Rule 

Following the notations defined in Section 2, we assume that a set of observed data, or samples, 
(Xi, y1), (х, уо), ..., (Xn Vn) is given. Our goal here is to extract a subset О, from X, = (xj); , in 
such a way that if и is the nearest member of О, to х, , then /(u) = y;, where /(u) is the label of 


u. Members of U, are called prototypes, and samples whose labels match those of their nearest 
prototypes are said to be absorbed. 

The CNN rule (Hart, 1968) is a simple way of solving the above problem. Starting with 
U, = (x9) , where x, is randomly chosen from X,, CNN scans all members of Х,. It then adds to 


U, a member x of X, whose nearest prototype's label does not match that of x. The algorithm 
scans X, as many times as necessary, until all members of X, have been absorbed or, equivalently, 
no more prototypes can be added to U,. 

Let 6, = тіп {|| x,-x |: x;, x; e X,,and/(x;) 2 /(x;)j, that is, 6, is the minimal distance 
between heterogeneous samples. Since {6,} is a decreasing sequence, there exists a $ such that 
0,20 as п ә о. The consistency of the CNN rule can be proved under the following two 


conditions. 1) Boundedness: all samples are included in a bounded set; that is, there exists a 
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region H of radius R such that Pr(x e Н} =1. 2) Non-zero separation: the limit 6 of {б„} is 
non-zero. 


Lemma 8 Under the conditions of boundedness and non-zero separation, the number of CNN 
prototypes cannot exceed (2R/ ó +1), where R is the radius of H and 6 is the limit of {б}. 
Proof: We want to prove that all prototypes are ó-separated, that is, their distance is at least д. 
This is true for any two prototypes with different labels, since all prototypes are samples and 
heterogeneous samples are 6 -separated. Therefore, we only have to prove that all prototypes of 
the same label are also 6 -separated. 


We assume that p and q are prototypes of the same label. As CNN 15 a sequential process, its 
prototypes are constructed in linear order. Without loss of generality, we assume that p is 
constructed before q; hence, there must be a prototype m that is constructed before q, 
(т) = /(q), and ||q-p||2||q—m ||. Now, since q and m have different labels, ||q—m]|2ó. 


Combining these two facts, we obtain || q—- p||2 ô. 

We define a ball of radius r centered at w as B(w,r) = (x :|| x— w || r }. Let the prototypes be 
{р}. Since they are 6 -separated from each other, all the balls B(p;, 9/2) are non-overlapping. 
Moreover, the union of these balls is contained in a ball of radius R+6/2 (Figure 2). So, we 
must have k(6/2) <(А + 6/2)! , or k «(2R/ó V^. С 








Figure 2. If all samples are contained in a ball of radius R, then all balls of radius 6/2 centered at 
a sample are included in a ball of radius R+ ó 72. 


From this lemma and the corollary to Theorem 6, we derive the following. 


Theorem 9 Let (g,) be a sequence of classifiers using the 1-NN rule based on CNN prototype 
data pairs. The boundedness and non-zero separation conditions ensure the consistency of {8n}. 


4 Adaptive Prototype Learning Algorithms 


An adaptive prototype learning algorithm is similar to CNN in that it adds as many prototypes as 
necessary until all samples have been absorbed. APL, however, differs from CNN in two 
respects: the absorption criterion and the nature of prototypes. In CNN, all prototypes are 
samples, whereas prototypes in APL can be samples or the weighted averages of samples. We 
denote prototypes that are samples as instance-based prototypes (IBPs) to differentiate them from 
cluster-based prototypes (CBPs), which are the weighted averages of samples. First, we develop a 
special type of APL algorithm for IBPs and prove its consistency under the conditions that ensure 
the consistency of CNN. We then propose a more complex type of APL that combines IBPs and 
CBPs, after which we address APL’s convergence and consistency properties. 
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4.4 Generalized CNN 


The instance-based APL includes CNN as a special case. For this reason, we denote it as a 
generalized CNN, or GCNN. The difference is that GCNN employs a strong absorption criterion, 
in contrast to the weak criterion employed by CNN. According to CNN, a sample x is absorbed if 


\х-а||—||х—р||>0, (1) 
where p and q are prototypes, p is the nearest homogeneous prototype to x, and q is the nearest 
heterogeneous prototype to x. For GCNN, however, we adopt the following criterion: 


\|x-q||—||x-p||> po, , p €[0.D) . (2) 
We say that a sample is weakly absorbed if it satisfies (1), and strongly absorbed if it satisfies 
(2). Note that (1) corresponds to the case where p = 0 in (2). Adopting (2) makes it possible to 
improve the classifier by optimizing p. The question of how to optimize p is addressed in 
Section 6. 
We now describe the steps of GCNN. 
Gl Initiation: For each label y, select a y-sample as an initial y-prototype. 


G2 Absorption Check: Check whether each sample is strongly absorbed (absorbed, for 
short). If all samples are absorbed, terminate the process; otherwise, proceed to the next 
step. 

G3 Prototype Augmentation: For each y, if any unabsorbed y-samples exist, select one as a 
new y-prototype; otherwise, no new prototype is added to label y. Return to G2 to 
proceed. 

In СІ, a y-sample is selected as follows. We let each y-sample cast a vote to its nearest y- 
sample, and select the one that receives the highest number of votes. In G3, an unabsorbed y- 
sample is selected as follows. Let V, = {x;: (x) =v & x; is unabsorbed}. We let each member of V, 
cast a vote for the nearest member in this set. The selected y-sample is the member of "P, that 
receives the highest number of votes. 


Lemma 10 GCNN prototypes satisfy the following properties. (a) For each prototype p, no 
heterogeneous sample can be found in В(р,о,). (b) For any two heterogeneous prototypes р and 
q, ||p-41[2 ó,. (c) For any two homogeneous prototypes m and n, || m—n||» (1— p)6,,. 


Proof: Propositions (a) and (b) follow from the fact that GCNN prototypes are samples and the 
separation between any two heterogeneous samples is at least 6,. To prove (c), let two 
homogeneous prototypes m and n be given, and let m be constructed before n. Since n is not 


absorbed by the time it is taken as a prototype, there exists a heterogeneous prototype q such that 
||n —q||—||n-—m||< oô, or, equivalently, 
|n -m ||>||n -q || -2ô8, - (3) 
Since n and q are heterogeneous, by (а), n cannot lie in B(q,ô,) . Thus, 
|n -q[[2 д. (4) 
Combining (3) and (4), we obtain || n - m||2 ô, – од, = (1— p)8,. O 





We define the number of iterations as the number of times G2 has been executed. The 
following lemma states that the number of iterations cannot exceed a certain magnitude. Let А, 


be the radius of the smallest ball containing all samples in Х,. 
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Lemma 11 The number of GCNN prototypes cannot exceed [(2R, + ó,)/(1— p)ó,]^. Moreover, 
GCNN converges within a finite number of iterations. 


Proof: Lemma 10 ensures that homogeneous and heterogeneous GCNN prototypes are separated 
by certain constants. Using this fact and a similar argument to that in the proof of Lemma 8, we 


conclude that the number of GCNN prototypes cannot exceed [(2R + ó,)/(1— p)ó,]^. Since at 


least one prototype is created at each iteration, the number of iterations cannot exceed this 
number either. О 





Now, under the conditions of boundedness and non-zero separation, we can also show that 
the number of GCNN prototypes is bounded from above by [(2R+6,)/(1— p)6,]“, with А 
replacing R,. Since [(2R+6,)/—p)6,]° <[(2R+6)/(1- p)ó]^, the number of GCNN 
prototypes is bounded from above by a constant independent of n. The consistency of GCNN 
follows from the same argument that demonstrates the consistency of CNN. 


Theorem 12 Under the conditions of boundedness and non-zero separation, GCNN is consistent. 


4.2 Linear Adaptive Prototype Learning 

Having explained GCNN, we are ready to describe a more complex type of APL that can take a 
mixture of IBPs and CBPs as its prototypes. To differentiate it from GCNN, and from another 
version of APL to be described later, we denote this algorithm as linear APL (LAPL). 

Recall that the consistency of GCNN derives from the separation of prototypes. We wish to 
obtain a similar separation between LAPL prototypes, but the addition of CBPs raises some 
problems. 

The first problem is the separation required for heterogeneous prototypes. While a 6, 
separation can be easily maintained by any two heterogeneous IBPs, it may not be maintained so 
easily by two heterogeneous CBPs. Therefore, we require the separation to be /Ó,, where 
f €[0.1]. How we determine the optimal value of fis discussed in Section 6. 

The next problem is the absorption criterion. For LAPL, we adopt the following: 


|x-ql|-IIx-pll» ofó,, for оє[0,). (5) 
How to optimize p is also addressed in Section 6. 


The third problem is how to maintain a positive separation between all LAPL prototypes. To 
achieve this objective, we specify the following requirements. 
(C1) For each prototype p, no heterogeneous sample exists in B(p, f, ) . 


(C2) For any two heterogeneous prototypes p and q, ||p—q||2 /ó, . 
(СЗ) For any two homogeneous prototypes m and n, | n -n ||» (1— p) f$, . 


Thus, in the transition from GCNN to LAPL, we have systematically changed ô, to fo,,. We 


now state the LAPL algorithm, and prove that the prototypes derived from it satisfy (C1), (C2), 
and (C3). We first describe the general scheme of the algorithm, and then provide the technical 
details. The steps of LAPL are: 
НІ Initiation: For each label y, initiate a y-prototype as the average of all y-samples. If this 
prototype does not satisfy (C1), (C2), and (C3), we apply the prototype adjustment 
module (described later in this section). 


H2 Absorption Check: Check whether each sample has been absorbed. If all samples have 
been absorbed, terminate the process; otherwise, proceed to the next step. 
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H3 Prototype Refreshment: For each un-satiated label y (ie., some y-samples are 
unabsorbed), select an unabsorbed y-sample. We then apply a clustering algorithm to 
construct clusters, using the selected y-sample and all existing y-prototypes as seeds. 
The centers of the resultant clusters are new y-prototypes. If these prototypes do not 
satisfy (C1), (C2), and (C3), we apply the prototype adjustment module (described later 
in this section). Return to H2 to proceed. 


We now provide the technical details. 

Selection of Unabsorbed Samples in H1 and H3. The selection procedures in H1 and H3 
are the same as those in G1 and G3. 

Clustering Algorithms in H3. Any clustering algorithm can be used. For the experiment 
described in this paper, we use the k-means (KM) and the fuzzy c-means (FCM) clustering 
algorithms, both of which are applied to training samples of the same label. Thus, if there are m 
labels in the training data, we apply the algorithms m times. Details of the methods are as follows. 

The KM method (Lloyd, 1982; Max, 1960; Linde et al. 1980) derives a locally optimal 


solution to the problem of finding a set of cluster centers {e;}/, that minimizes the objective 
function 


25 
j=l 


KM’s iterative process is performed as follows. Setting seeds as the initial cluster centers, we 
add each sample to the cluster whose center is nearest to it. We then reset the center of each 
cluster as the average of all the samples that fall in that cluster. To ensure rapid convergence of 
KM, we require that the process stops when the number of iterations reaches 30, or the 
membership of the clusters remains unchanged after the previous iteration. 


In FCM (Bezdek, 1981; Hóppner et al., 1999), the objective function to be minimized is 


2:32:05 l| e; -x; IP. for me(1,0) (7) 


: 2 
min |е, =x; ||. (6) 
i=l,...,p 


under the constraint 


Y? и =1, forj=1,2, ua (8) 


i=l Uy 
where и, is the membership grade of sample x, to prototype c;. Using the Lagrangian method, 


we can derive the following equations: 


(Ie =x, П) 


Уе ле, ж JE ©) 
Y? (lle, - x, lla 


y ujX; 


=a (10) 


Уу и” 

j=l 49 
fori-1,2,.., p, andj = 1, 2, ..., n respectively. FCM is a numerical method that finds a locally 
optimal solution for (9) and (10). Using a set of seeds as the initial solution for{e,}?,, the 


і=1 > 


рт 


algorithm computes {и} aand {с,};| iteratively. To ensure rapid convergence of FCM, we 


old — new [= 0 | 


require that the process stops when the number of iterations reaches 30, ог У,” |е? —c; 


The prototype adjustment module is used to adjust the location of prototypes if they do not 
satisfy the separation conditions (C1), (C2), and (C3). 
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Prototype Adjustment in H1. The purpose of this module is to replace prototypes that 
violate (C1) or (C2) with those that do not. Note that there is only one prototype per label in H1, 
so we do not need to worry about (C3). There are two steps in this stage. 

Step 1: If we find a CBP p that violates (C1), which requires that no heterogeneous sample 
exists in B(p, /5,), we replace р with a sample of the same label. The replacement 
sample is an IBP and is selected in exactly the same way as a seed is selected in СІ. 

Step 2: If we find a CBP p that violates (C2), which requires that || p -q ||2 /5, for any 


other prototype q, we replace p with an IBP of the same label. We perform this 
operation iteratively, until the desired separations hold between CBPs, and between 
CBPs and IBPs. 


We now prove that after these two steps, all prototypes satisfy (C1) and (C2). We first prove 
that, after Step 1, all prototypes satisfy (C1). By assumption, all CBPs satisfy (C1) after this step. 
Also, all IBPs satisfy (СІ), since all heterogeneous samples are 6, -separated from them and 
ó, 2 fó,. We now prove that, after Step 2, all prototypes satisfy both (C1) and (C2). It is clear 
that all prototypes satisfy (C1) at the end of this step; and each CBP maintains /6,,-separations 
from other prototypes by assumption. Also, since each IBP maintains ô, -separations from other 
IBPs and 6, > /ó,, each IBP maintains fo, -separations from other IBPs. 

Prototype Adjustment in H3. This module adjusts prototypes in two steps. 

Step I: A set of prototypes of the same label is called a pack. When a pack consists of CBPs 
that satisfy (C1) and they are (1— р) /5, -separated from each other, as required by 
(C3), we preserve that pack. Otherwise, we replace it with the set of seeds from which 
the CBPs were derived. 

Step II: Two packs are said to be /0, -separated if any two prototypes drawn from them are 
fo,,-separated. When we find two packs that are not fó, -separated, we replace one of 
them with the set of seeds from which its prototypes were derived. We perform this 
operation iteratively until the remaining packs аге /9, -separated. 

We now show that, after Step I, all prototypes satisfy (C1) and (C3). For convenience, we call 
preserved packs P-packs and replacement packs R-packs. An R-pack consists of existing 
prototypes, called X-prototypes, and an unabsorbed sample, called a U-prototype. By induction, 
all X-prototypes meet (C1) and (C3). The U-prototype, denoted as п, also satisfies (C1), because 
heterogeneous samples are $, -separated from each other. It remains to show that u is (1— о) fô,- 


separated from all X-prototypes. This fact follows from a similar argument to that for Lemma 
10(c), so we omit the proof. We conclude that all the prototypes satisfy (C1) and (C3). 

We now prove that, after Step II, all prototypes satisfy (C1), (C2), and (C3). It is clear that all 
prototypes satisfy (C1) and (C3) at the end of this step. It remains to prove that they also satisfy 
(C2), that is, heterogeneous prototypes are  fÓ,-separated. We want to show that all 
heterogeneous prototypes in the R-packs are /6,-separated. As noted earlier, these prototypes 
consist of X-prototypes and U-prototypes. By induction, heterogeneous X-prototypes are fô,- 
separated. Heterogeneous U-prototypes are also f/Ó, -separated, as noted before. U-prototypes are 
also /5, separated from heterogeneous X-prototypes, since all X-prototypes satisfy (C1). Thus, 
at the end of Step II, all prototypes satisfy (C2). 

The f6,,-separation between heterogeneous prototypes and the (1— p) /5, -separation between 
homogeneous prototypes imply the convergence of LAPL and also its consistency under the 
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conditions of boundedness and non-zero separation. Note that the above conclusions do not hold 
for the case where f= 0, which we deal with in Section 5. 


Theorem 13 The LAPL terminates within a finite number of iterations, provided that f € (01, 
p €[0,1), and m e (1,o). 


Theorem 14. Let (g,) be a sequence of classifiers using the 1-NN rule based on LAPL prototype 
data pairs. The conditions of boundedness and non-zero separation ensure the consistency of 
{g,j, provided that f €(0,], p €[0,1), and m e (1, о). 


S  Kernelized Adaptive Prototype Learning Algorithms 


Let o: R^ — Н be a function that maps from the d-dimensional Euclidean space to a Hilbert 
space, whose dimension dim(/) may be infinite. In a kernelized adaptive prototype learning 
algorithm, the goal is to build prototypes in H. To this end, we first transform the given observed 


data (x,j/, into ($(x,)j7,. When either KM or FCM is used to compute prototypes, each 


prototype in H is of the form 


n n 
с; =) ану Dx)», uj ; 


where c; and uj were introduced in (6), (7), and (8). When KM is used, т = 1. Moreover, и; = 1/n; 
provided that the j” sample falls in the i" cluster, whose population size is nj otherwise, и; = 0. 
When FCM is used, we compute uj; according to (9) in which the distance now becomes a kernel- 
based distance, to be defined below. 


If dim(H) = оо, е, cannot be expressed in vector form. Even when dim(/7) < оо, it can be 
computationally expensive to find an explicit form of c;. Fortunately, we can compute the 
distance between Ф(х,) and с, directly, provided there exists a kernel function (Mercer, 1909; 
Girosi, 1998) 

К(х,у) =(Ф(х),Ф(у)), forxyeR". 


When such a function exists, we obtain the kernel-based distance as 


le - P(X) lke = (e; - Ф(х ),с, -Ф(х )) 





- (ee) ~2(¢,,@(x,))+((x,), P(x,)). (11) 
Moreover, 
YY ull Ku x) 
(се, — ke (12) 


s) 


Уик K(X,X;) 


(c, (x )) = ££— (13) 
и 
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and 
(®(x,),®(x,)) = K(x;.x;). (14) 
From {|е - ®(x,) IIT... Р we derive {и}? according to the appropriate formula in the 
clustering algorithm being used. Since we do not want to express the prototypes explicitly, we use 


{и 3071 to represent them instead. From the prototypes, we can compute (|| e; — Ф(х,) IIT... e 


again, using (11)-(14). This iterative process stops if the number of iterations reaches 30, or 
od ипе" |< 0.001. These steps represent the kernelized versions of KM and FCM, 


max |u’ 
i,j y 


depending on which definition of {и}? is used. The kernelized version of FCM was proposed 
and studied by Wu, Xie and Yu (2003) and Kim et al. (2005). 

There is also a kernelized version of GCNN, but we do not consider it in this paper, for the 
reason to be given in Section 7. We denote the kernelized version of LAPL simply as KAPL, 
which is derived from LAPL by replacing KM or FCM with an appropriate kernelized version. In 
addition, we make the following changes. First, the initial y-prototype in KAPL should be the 
average of all ((x) :/(x) = y}. Using (11)-(14), we compute the distance of each Ф(х) to this 
prototype. Second, we apply the prototype adjustment module in KAPL to separate prototypes. 
Prototype separation, however, does not imply the convergence of KAPL, since it may have to 
deal with data in a space of infinite dimensions. 

To ensure the convergence of KAPL, we modify the prototype adjustment module as follows. 
As in LAPL, we adopt the necessary operations to create the desired prototype separation. 
However, prior to these operations we check if each prototype in a pack has a non-empty domain 
of attraction (DOA), where the DOA of a y-prototype p is the set of all y-samples ®(x) for which 
p is the nearest prototype. Recall that we employ a clustering algorithm to create a pack of 
prototypes, using an unabsorbed sample Ф(и) and some other prototypes as seeds. If any 
prototype in a pack has an empty DOA, we replace that pack with the pack of prototypes 
constructed earlier. In this case, Ф(и) is called a futile sample. If a sample is declared futile in an 


iteration, it will not be taken as a sample in any later iteration. 


Theorem 12 The KAPL algorithm converges within a finite number of iterations. 
Proof: The number of futile samples is bounded from above, since it cannot exceed n, that is, the 
number of samples. We assume that the last futile sample is created at iteration i, with i < и. If all 
samples are absorbed at the end of i, the proof is complete; otherwise, more prototypes will be 
created, all with non-empty DOAs. The number of unabsorbed samples must decrease to zero, or 
else the number of DOAs would eventually exceed the number of samples, which would be an 
absurd result. L 

Note that if we treat futile samples in LAPL in the same way, we can prove the convergence 
of LAPL in the setting where there is no guarantee of prototype separation. 


Theorem 13 Adopting the prototype adjustment module used in KAPL, LAPL converges for 
f =0 and m € (1,o). 


6 Soft Adaptive Prototype Learning Algorithms 


The versions of APL proposed thus far are designed to continue constructing prototypes until all 
training samples are absorbed or, equivalently, the training error declines to zero. These could be 
called hard versions of APL. Insistence on a zero training error, however, runs the risk of 
overfitting. Another approach, called the soft alternative, maintains the error rate at a level that 
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enhances the generalization power of the resultant classifier. The optimal error rate can be 
determined in a cross validation task, which is also needed to find the optimal values of the 
parameters. All versions of APL involve some parameters; for example, they all involve f and р, 


which regulate prototype separation (cf. (2), (5), (C1), (C2), and (C3)). Moreover, if FCM is used 
to compute cluster centers, there is another parameter m (cf. (7), (9), and (10)) to consider. In 
addition, some parameters in KAPL are used to define the kernel-based distance. For example, 
when the RBF kernel 


K(x, y) =ехр(-у lx — y ||’) (15) 
is used to define the distance, there is an additional parameter y , whose range is assumed to be 
(0,00). 


To search for the optimal values of the parameters, we perform cross-validation. As all the 
parameters are assumed to be independent, we must evaluate all combinations of them and 
determine which one is the most suitable for the task. When a combination of parameter values О 
is given, we build prototypes on K-1 folds of data, which serves as the training data, and measure 
the test accuracy on the remaining fold of data, which serves as the validation data. We determine 
the optimal training error rate associated with Q as follows. 

Given a set of training data and a set of validation data and assuming that the latter is the k” 
fold of the data, k — 1, 2, ..., K, we construct prototypes and record the following information. 
First, for a given level of e, we record the lowest number of iterations n,(e,Q) at which the 


training error rate falls below e. We also compute the validation accuracy rate у, (е, Q) for all the 


prototypes obtained at the end of iteration л, (е, О). Let v(e,Q) = PIE v, (e, Q)/ К. The optimal 


training error rate is then 


ep (Q) = arg max v(e, О). 


Note that once we have constructed prototypes to achieve a training error ej, we do not need 
to start from the scratch to obtain a lower training error ез. Instead, we continue to construct more 
prototypes until e; is reached. At the end of this process, we obtain v(e,Q) for all e and thus 


v(e,,,(Q), О). When we have done this for all О, we obtain the optimal О as 
О = arg и v(e,,,(Q).Q) . 


One additional parameter that needs be optimized is the number of k nearest prototypes, 
which we use in a voting mechanism to determine the label of a test sample. If a tie occurs, we 
classify the sample according to the nearest prototype. The optimal value of k should be evaluated 
in the cross-validation applied to the other parameters. 


7 Experimental Results 


To evaluate the APL algorithms and compare their performance with that of alternative methods, 
we use 12 benchmark data sets retrieved from the UCI databases (Newman et al., 1998). The 
results are described in three subsections. The first describes the four types of APL. In the second 
subsection, we compare the performance of GCNN with six instance-based prototype algorithms 
proposed in the literature. Then, in the third subsection, we compare the performance of the four 
APLs with SVM and &-NN. Note that many of the methods, including ours, require that the data 
must be bounded. One way to meet this requirement is to normalize all the feature values to 
[0,255], which can be done by the following linear transformation: 
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(x -—v)x 255 
xi 
У-у 


, 


where x is a given feature value, V is the maximum value of the feature, and v is the minimum 
value. All experimental results reported in this section were obtained using an Intel Pentium 4 
CPU 3.4GHz with a 1GB RAM. 


7.1 Evaluation of APLs 

The four types of APL are listed in Table 1. The first one is GCNN. The other three types of APL 
are: fuzzy linear APL (f-LAPL), crisp kernelized APL (c-KAPL), and fuzzy kernelized APL (/- 
KAPL). We use “f-” to indicate that the clustering algorithm employed is FCM, and “c-” to 
indicate that the technique is KM. In the experiments, the soft versions of the four APLs are used. 
Although we can consider the kernelized version of GCNN using RBF as the kernel function, this 
version of GCNN gains only slightly higher testing accuracy, at the expense of a much higher 
number of prototypes, than GCNN. So we choose not to discuss it. We do not discuss c-LAPL 
either, since it usually has a lower performance than f-LAPL. 


In Table 2, we show the parameters used in the four types of APL and also the values of the 
parameters whose combinations are considered in our experiments. The values result from a 
trade-off between the demand for accuracy and the need to reduce the computation time. When a 
combination, О, of parameter values is given, we have to record v(e, Q) for certain values of e. In 
our experiments, the values of е, at which we record v(e, О), are percentages that start from 0% 
and increase by some increments until they reach 30%. All the percentages are listed in Table 3. 
The 12 benchmark data sets retrieved from the UCI databases are listed in Table 4, which also 
shows the number of labels, the number of samples, the number of features per sample, and the 
number of folds into which we divide the samples during cross validation. 














Assumed Distance 
GCNN Euclidean 
f-LAPL Euclidean 
c-KAPL RBF 
-KAPL RBF 











Table 1. The four types of APL studied in our experiments. 





Values GCNNIFLAPL |c-KAPL|AKAPL 
f 0., .1, .25, .5, .75, 1. 
p 0., .1, .25, .5, .75, .99 ү ү ү ү 
т 1.05, 1.1, 1.2, 1.3, 1.4 ` ` 
y | ах10?;а=1,2,...„9;Ь=4,5,...‚7 Y Y 
































Table 2. Parameters: their value range, and the types of APL that involve them; “V” indicates 
that the parameter is used in that type of APL. The parameters f and р appear in (2), 


(5), (C1), (C2), and (СЗ); m appears in (7); and y appears in (15). 





Values of e 
————————n—— 
0%, 1%, 2%, 3%, 4%, 5%, 7.5%, 10%, 20% 











Table 3. The values of e at which we record v(e,Q). 
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Number of | Number of | Number of | Number of 
Labels Samples | Features Folds 

Iris 3 150 4 5 
Wine 3 178 13 5 
Glass 6 214 9 5 
Ionosphere 2 351 34 10 
Cancer 2 683 9 10 
Zoo 7 101 16 5 
Heart 2 270 13 5 
TAE 3 151 5 5 
BUPA Liver Disorders (BLD) 2 345 6 5 
New Thyroid 3 215 5 5 
SPECTF 2 267 44 5 
Ecoli 8 336 7 5 























Table 4. Information contained in the 12 data sets. 


In Table 5, we show three performance measures of the four APLs, namely, the accuracy rate, 
the training time, and the condensation ratio. Given that K-fold cross-validation is conducted, the 
accuracy rate (AR) is the average accuracy over all validation data sets, each of which is one of 
the K folds; the training time (TT) is the sum of the training times of all training data sets, each of 
which consists of K-1 folds; and the condensation ratio (CR) is the average prototype-to-sample 
ratios obtained from all training data. Note that for most types of APL, we drop the decimal parts 
of their training times, since they are relatively insignificant to the integer parts. At the bottom of 
Table 5, we also show the average of the three measures over the 12 data sets. The boldface 
figures indicate that the performance of the corresponding method is the best of all the methods 
applied to the given data set. 

The averaged figures in Table 5 show that, in terms of training time, the four APLs are 
ranked in the following order: GCNN, АРІ, c-KAPL, and КАРІ. The number of all possible 
combinations of parameter values is the major factor that affects the amount of training time. If 
we divide the total training time by the above number, then the temporal differences among the 
four algorithms are reduced drastically, as shown in Table 6. Since APL training under different 
combinations of parameter values is conducted independently, some fashion of parallel 
computing, such as cluster computing or grid computing, would help reduce the training time. 

GCNN requires the least amount of training time because it picks samples as prototypes, 
thereby avoiding the rather costly computation of clustering. The c-KAPL and КАРІ, 
algorithms, on the other hand, employ kernelized versions of KM and FCM respectively, which 
are relatively slow. In terms of accuracy, the order of the four APLs 15 exactly the opposite of that 
for the training time. 
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DATA SET GCNN | FLAPL | c-KAPL | -KAPL 


АК | 96.62 | 97.95 98.63 98.40 
Iris TT 0.8 30 16,225 81,334 
CR, 9.6 10.33 54.00 5.83 
АК 98.06 | 99.02 99.02 99.56 
Wine TT 1 144 24,378 175,647 
CR | 217 18.40 20.22 92.98 
AR | 69.39 | 71.26 12.23 12.13 
Glass TT | 137 314 18,906 | 108,891 
CR | 48.5 35.98 44.98 22.90 
АК | 89.07 | 91.46 95.88 95.87 
Ionosphere | TT | 9.45 8,010 | 399,693 | 3,078,420 
CR | 17.5 5.63 4.56 6.05 
АК 97.5 97.79 97.50 97.79 
Cancer TT | 334 2,301 | 496,817 | 5,265,013 
СК 179 4.44 12.74 19.70 
АК | 97.66 | 97.66 97.66 97.66 
Zoo TT | 0.83 11 21,066 | 135,346 
CR 232 18.32 22.77 24.50 
АК | 85.57 | 86.90 85.83 86.43 
Heart Rate | TT | 1.56 1,134 | 72,925 | 607,436 
CR 42.6 21.67 35.83 23.98 
AR | 6321 62.47 65.22 65.61 
TAE TT | 0.95 229 18,682 | 133,157 
CR | 432 51.82 45.86 46.85 
AR | 65.93 | 67.34 67.72 70.52 
BLD TT 2.4 3,379 | 232,211 | 1,378,124 
CR 47.9 35.87 74.13 23.33 
AR | 97.31 97.76 98.57 99.05 
New Thyroid | TT | 0.92 135 19,289 | 134,671 
CR, $7 12.79 3.72 10.00 
АК | 83.55 | 85.63 86.13 87.04 
SPECTF TT 9.1 8,820 | 167,428 | 1,363,339 
СК | 283 30.15 50.37 28.37 
АК | 86.44 | 86.81 86.18 87.06 
Ecoli TT | 1.2 920 35,151 | 216,235 
CR | 24.6 31.85 48.51 27.98 


АВ 85.86 | 86.84 87.55 88.14 
AVERAGE | TT | 23 2,119 | 126,948 | 1,056,468 
CR| 275 23.10 34.81 2721 


























































































































Table 5. The performance of the four APLs, where AR = Accuracy Rate (96), TT = Training 
Time (sec), and CR = Condensation Ratio (96). 
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Number of Total Training Time 
Combinations | Training Time | per Combination 
GCNN 6 23 0.38 
f-LAPL 155 2,119 13.67 
c-KAPL 1,116 126,948 113.75 
/-КАРГ, 5,580 1,056,468 189.33 


Table 6. The number of all possible combinations of parameter values, the total training time, 
and the training time per combination for the four types of APL. 





Time to Compute 5, (sec) | GCNN Run Time (sec) | Ratio (%) 
————————————ÓÜ 


Average 0.012 0.38 3.2 




















Table 7. The average amount of time to compute 6, , the average run time of GCNN, and their 
ratio. 


These findings suggest that the high accuracy rates of APLs are derived at the expense of a 
rather high computational cost. Hence, there is a tradeoff between accuracy rates and training 
costs, which allows users to choose the most suitable APL based on the size of their problems, 
their computing resources, and the degree of accuracy they require. There are two reasons for this 
tradeoff. First, the cluster-based approach has higher generalization power than the instance- 
based approach, since it picks the weighted averages of samples as prototypes and they are 
relatively immune to noise. Second, the RBF-based approach has higher generalization power 
than the Euclidean-based approach. To understand why this 1s so, we note that for very small у, 


the ВВЕ distance between x and y is approximately 2y ||x-y ||. This means that the RBF 


distance covers the Euclidean distance as a special case, and using the RBF distance may allow us 
to find a better-performing classifier than the one we obtain by using the Euclidean distance. 
Recall that when applying any APL algorithm we must first compute 6,, the minimum 


distance between heterogeneous samples. One may be curious about the ratio of the computing 
time for д, to the run time of APL. In fact, the ratio is 3.2% for GCNN (Table 7) and much less 


for the other types of APL. 

The reason for such a small ratio is as follows. If the number of training samples is n, then the 
time complexity of computing 6, is in the order of n’, while the time complexity of conducting 
APL training is in the order of n°. To confirm the latter fact, we note that APL training takes no 
more than n iterations. Within each iteration, checking the absorption criterion takes no more than 
п? steps, and clustering takes no more than 30x. steps (if cluster-based prototypes are required), 
where 30 is the maximum number of iterations allowed in a clustering algorithm. Furthermore, 
the space complexity of APL training is in the order of n? at most. 

LAPL and КАРІ are associated with parameters f and p, which appear in the absorption 
criterion (5) and requirements (C1), (C2), and (C3) (cf. Section 4). We were curious to know how 
the parameters’ values affect the prototypes built in the training process, so we studied the 
training of АРІ on the 12 data sets. We assume that all parameters, except f are fixed at 
certain values. The absorption criterion requires that a training sample should be closer to its 
nearest homogeneous prototype than to its heterogeneous prototype by at least fpó, . If we raise 


the value of f, we increase the likelihood of a sample becoming unabsorbed so that more 
prototypes would have to be built. This fact is reflected in Table 8, which shows that the average 
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condensation ratio increases as the value of f increases. What happens when we fix the values of 
all parameters except о? By raising the value of р, we also make the absorption criterion more 
difficult to satisfy and therefore increase the number of prototypes that need to be built. This fact 
is reflected in Table 9. 





f 0.00 | 0.10 | 0.25 | 0.50 | 0.75 | 1.00 
| 
Average Condensation Ratio (%) |22.18|23.26|24.56|27.83|30.70|34.92 





























Table 8. Average condensation ratio of f-LAPL over the 12 data sets for various values of f 
when m= 1.1, о= 0.5, and e = 0. 





p 0.00 | 0.10 | 0.25 | 0.50 | 0.75 | 0.99 
———————————————————a— 
Averaged Condensation Ratio (vo) 122.25123.22/24.53127.83130.81/34.31 





























Table 9. Average condensation ratio of АРІ over the 12 data sets for various values of р 
when m = 1.1, f= 0.5, and e = 0. 


7.2 Comparison of GCNN with Some Instance-Based Learning Algorithms 


As noted earlier, GCNN differs from LAPL and KAPL in that it adopts samples as prototypes. It 
is thus one of the methods, called instance-based learning algorithms, which reduce an entire set 
of training samples to a subset, while maintaining as much generalization power as possible. For 
this reason, we compare GCNN with some of the methods that have been proposed in the 
literature. 


Two approaches can be adopted in IBL algorithms. The first is incremental, so it starts with a 

null set and gradually adds samples as prototypes. Both CNN and GCNN are incremental 
algorithms. For comparison purposes, we also include a primitive version of GCNN, called 
pGCNN. It is similar to GCNN, except that the value of parameter f is fixed at 0. Note that 
pGCNN is not the same as CNN. In pGCNN, we select unabsorbed samples through a voting 
procedure (cf. Section 4) and the training error rate e is determined by cross-validation (cf. 
Section 6). In CNN, however, unabsorbed samples are selected randomly and e is fixed at 0. 
The second approach is decremental, so it starts with the entire set of samples and gradually 
removes samples that are considered properly “protected” by the retained ones. For algorithms of 
this type, we include DROPI to DROPS (Wilson and Martinez, 2005) and ICF (Brighton and 
Mellish, 2002) for comparison. They differ from each other in the way samples are ordered for 
removal, and in the criterion for removing samples. For further details, readers should refer to the 
cited references. We used the code provided by Wilson and Martinez (2005) for DROPI to 
DROPS, and implemented our own codes for ICF. 

For all the methods, we apply cross-validation, similar to that used for the APLs, whereby the 
12 data sets are divided into the same number of folds (cf. Table 4). Moreover, in measuring the 
test accuracy, we use the top-k nearest prototypes with k being determined in the cross-validation 
(cf. Section 6). Table 10 shows the performance of all the instance-based methods, with the 
averaged results shown at the bottom of the table. From the latter results, we observe that GCNN 
achieves the best accuracy among all the compared methods. In general, the incremental methods 
have lower training costs than the decremental methods. The only exception is GCNN, which is 
little slower than ICF. On the other hand, the incremental methods build more prototypes than 
the decremental methods. Among the incremental methods, GCNN achieves a higher accuracy 
rate than the other two methods, at the expense of building more prototypes and a higher training 
cost. Meanwhile, pGCNN constructs fewer prototypes and has a lower training cost than GCNN, 
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DATA SET Incremental Methods Decremental Methods 
CNN | pGCNN | GCNN | DROP! | DROP2 | DROP3 | DROP4 | DROPS | ICF 
95.7] 95.23 95.23 95.53 | 95.04 
Iris 0.19 0.16 0.16 0.14 0.11 
9.2 10.5 10.5 8.9 22 
93.15 93.42 93.42 98.06 | 92.81 
Wine 0.41 0.36 0.36 0.56 0.13 
12.6 12.1 12.1 8.1 11.1 
65.83 66.23 67.12 64.19 | 64.09 
Glass 0.33 0.41 0.39 0.53 0.17 
27.1 18.8 24.3 23.5 22.3 
88.16 86.70 88.06 88.06 | 81.20 
Ionosphere 6.6 8.18 7.8 13.98 0.48 
10.1 5.3 8.2 9 3.7 
96.47 96.03 96.47 96.32 | 96.47 
Cancer 342 34.8 30.2 21.9 0.67 
5 3 3.7 3.9 2:5 
93.46 92.53 93.82 90.43 | 90.59 
Zoo 0.28 0.25 0.25 0.28 0.16 
17 18.4 18.8 15 44.3 
82.10 79.73 80.31 81.09 | 76.41 
Heart 0.97 1.09 1 1.2 0.14 
16.1 111 12.8 13.5 14.3 
51.15 51.36 53.63 55.64 | 52.12 
TAE 0.09 0.11 0.11 0.07 0.11 
27.1 23 24.3 28.8 26.6 
60.34 59.98 62.75 63.92 | 60.22 
BLD 0.5 0.63 0.55 0.7 0.12 
29.9 19.3 25.2 234 18 
New 93.85 95.27 94.42 93.61 | 93.55 
Thyroid 0.36 0.33 0.33 0.45 0.14 
11.3 7.2 8.4 7.6 8 
74.99 79.98 74.29 76.13 | 76.26 
SPECTF 3.17 3.28 2.9 3.86 0.25 
16.5 9.1 11.7 11.7 10.1 
85.94 83.53 86.64 84.36 | 83.17 
Ecoli 0.84 1.44 1.28 1.36 0.25 
14.6 12.00 12.7 12.2 11.3 
AR | 83.51 84.47 85.86 79.02 81.76 81.67 82.18 82.28 | 80.16 
AVERAGE | TT | 0.12 | 0.19 23 2.76 4 425 3.78 375 | 023 
CR [32.01 21.2 245 11.9 16.4 12.5 14.4 13.8 16.2 





























Table 10. The performance of three incremental methods and six decremental methods. 


and yields higher accuracy and generates fewer prototypes than CNN. Both GCNN and pGCNN 
generate fewer prototypes than CNN, because their training error rate e can be non-zero, while it 


is fixed at zero for CNN. 


Since pGCNN is a special case of GCNN with p=0, comparison of their accuracy rates 


offers us an opportunity to examine the sensitivity of GCNN to the parameter values. The 
difference between the average accuracy rates is 1.39%, but for the Ecoli and Heart data sets, the 
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differences increase to 2.86% and 3.62% respectively, showing that the search for the optimal 
parameter values can be very useful. A similar situation 1s found with other types of APL. 


7.3 Comparison of LAPL апа КАРІ with k-NN and SVM 


To further evaluate the performance of the four APLs, we run two other alternative learning 
methods: k-NN, and SVM. Once again, for both methods, we apply cross-validation, similar to 
that used for APLs. For SVM, we employ the soft-margin version with the RBF kernel. Recall 
that the RBF function involves a parameter y . In SVM, the value range of у is taken as {ax10”: 


a=1,2,...,9 апар = 3,4, ..., 6}, which differs from that of КАРІ by a factor of 10. Also, since 
the soft-margin version of SVM is used, there is an additional parameter C, which serves as a 
penalty factor for SVM training errors whose value range is taken as (10^: с = -1, 0, ..., 5}. We 
use the LIBSVM toolkit (Hsu and Lin, 2002) to train SVM. For k-NN, the optimal value of k is 
determined during cross-validation, in much the same way that we optimize the k nearest 
prototypes for use in the voting procedure to determine the label of a test sample (cf. Section 6). 

One crucial difference between SVM and APL is the way of dealing with multiclass data sets, 
that is, data sets comprised of more than two class types. Since SVM only deals with one binary 
classification at a time, we need to use a decomposition scheme when applying it to multiclass 
data sets. We employ one-against-others (Bottou et al., 1994) in our experiment. In other words, 
if there are т class types in total, we train m SVM classifiers, each of which classifies a sample as 
А or not А, where A is one of the т class types. One-against-one (Knerr et al., 1990; Platt et al., 
2000) is an alternative decomposition scheme that allows us to train m(m-1)/2 classifiers. In our 
experience, the one-against-others scheme usually yields comparable or better accuracy rates than 
the one-against-one approach; however, the training cost is higher. For APLs, on the other hand, 
we construct prototypes for all class types simultaneously. Thus, in our experiments, there 1s ло 
decomposition scheme for APLs. 

The accuracy rates and training times of all the methods are given in Table 11. The boldface 
numbers have the same meaning as before, while the underlined numbers are the accuracy rates 
that are lower than the corresponding SVM results. As usual, we list the averaged results over all 
the 12 data sets at the bottom of the table. From the last results, we observe that all the APLs 
outperform A-NN in terms of accuracy; and GCNN is faster in training than SVM, but it is less 
accurate. The other three APLs incur higher training costs than SVM, but yield higher accuracy 
rates. 


8 Conclusion 


We have proposed a number of adaptive prototype learning algorithms that construct prototypes 
out of training samples. They differ in the use of samples or the weighted averages of samples as 
prototypes, and in the use of the Euclidean distance or a kernel-based distance. The algorithms 
can be further strenghened by allowing a non-zero training error rate, which improves the test 
accuracy. Our experiments, in which four types of APL were applied to 12 benchmark data sets, 
confirm the algorithms’ efficacy in terms of test accuracy compared to many instance-based 
learning algorithms, the &- NN rule, and SVM. 
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Alternative 
DATA SET pu Metoda Methods 
GCNN | £LAPL |c-KAPL| £KAPL | &-NN | SVM 
Iris AR 98.63 | 9840 | 97.03 | 96.47 
TT 16,225 | 81,334 70.52 
| AR 99.02 | 99.56 | 97.64 | 98.97 
Wine 
TT 24,378 | 175,647 90.88 
T AR 72.23 | 72.73 | 7040 | 69.43 
TT 18,906 | 108,891 299.68 
ое АК 95.88 86.72 | 95.08 
ТТ 399,693 362.20 
© AR 97.50 96.9] | 97.06 
Mesh pp 496,817 321.92 
zm AR 97.66 96.55 | 95.86 
TT 21,666 | 135,346 153.64 
AR 85.83 | 8643 | 83.77 | 84.83 
Heart 
TT 72,925 | 607,436 130.00 
ds AR 6522 | 6561 | 57.78 | 6423 
TT 18,682 | 133,157 605.24 
AR 67.72 63.90 | 71.19 
BED TT ; 232,211 1181.68 
New Thyroid AR | 97.31 | 97.76 | 9857 | 99.05 | 9631 | 97.78 
TT | 0.92 135 | 19,289 | 134671] — | 78.80 
AR 86.13 79.90 | 81.48 
PRECIS TT 167,428 143.88 
Ecoli AR 86.18 87.33 | 88.13 
TT 35,151 | 216,235 421.88 
AVERAGE LAR | 85.86 | 86.84 | 87.55 | 88.14 | 84.52 | 86.71 
TT | 23 2.119 | 126,948 [1,056,468 321.72 








Table 11. The performance of the four APLs, &-NN and SVM. 
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